Autumn Wilcox – Contributed to analytic coding, content draft, structured the project workflow, and collaborated actively via GitHub.
Namita Mishra – Developed the project plan, content draft, analytic coding, and coordinated commits and collaboration with the group on GitHub.
Introduction
Diabetes mellitus (DM) is a major public health concern closely associated with factors such as obesity, age, race, and gender. Identifying these associated risk factors is essential for targeted interventions D’Angelo and Ran (2025). Logistic Regression (traditional) that estimates the association between risk factors and outcomes is insufficient in analyzing the complex healthcare data (DNA sequences, imaging, patient-reported outcomes, electronic health records (EHRs), longitudinal health measurements, diagnoses, and treatments. Zeger et al. (2020). Classical maximum likelihood estimation (MLE) yields unstable results in samples that are small, have missing data, or presents quasi- and complete separation.
Bayesian hierarchical models using Markov Chain Monte Carlo (MCMC) allow analysis of multivariate longitudinal healthcare data with repeated measures within individuals and individuals nested in a population. By integrating prior knowledge and including exogenous (e.g., age, clinical history) and endogenous (e.g., current treatment) covariates, Bayesian models provide posterior distributions and risk predictions for conditions such as pneumonia, prostate cancer, and mental disorders. Parametric assumptions remain a limitation of these models.
In Bayesian inference Chatzimichail and Hatjimihail (2023), Bayesian inference has shown that parametric models (with fixed parameters) often underperform compared to nonparametric models, which do not assume a prior distribution. Posterior probabilities from Bayesian approaches improve disease classification and better capture heterogeneity in skewed, bimodal, or multimodal data distributions. Bayesian nonparametric models are flexible and robust, integrating multiple diagnostic tests and priors to enhance accuracy and precision, though reliance on prior information and restricted access to resources can limit applicability. Combining Bayesian methods with other statistical or computational approaches helps address systemic biases, incomplete data, and non-representative datasets.
The Bayesian framework described by R. van de Schoot et al. (2021) highlights the role of priors, data modeling, inference, posterior sampling, variational inference, and variable selection.Proper variable selection mitigates multicollinearity, overfitting, and limited sampling, improving predictive performance. Priors can be informative, weakly informative, or diffuse, and can be elicited from expert opinion, generic knowledge, or data-driven methods. Sensitivity analysis evaluates the alignment of priors with likelihoods, while MCMC simulations (e.g., brms, blavaan in R) empirically estimate posterior distributions. Spatial and temporal Bayesian models have applications in large-scale cancer genomics, identifying molecular interactions, mutational signatures, patient stratification, and cancer evolution, though temporal autocorrelation and subjective prior elicitation can be limiting.
Bayesian normal linear regression has been applied in metrology for instrument calibration using conjugate Normal–Inverse-Gamma priors Klauenberg et al. (2015). Hierarchical priors add flexibility by modeling uncertainty across multiple levels, improving robustness and interpretability. Bayesian hierarchical/meta-analytic linear regression incorporates both exchangeable and unexchangeable prior information, addressing multiple testing challenges, small sample sizes, and complex relationships among regression parameters across studies Leeuw and Klugkist (2012)
A sequential clinical reasoning modelLiu et al. (2013) Sequential clinical reasoning models demonstrate screening by adding predictors stepwise: (1) demographics, (2) metabolic components, and (3) conventional risk factors, incorporating priors and mimicking clinical evaluation. This approach captures ecological heterogeneity and improves baseline risk estimation, though interactions between predictors and external cross-validation remain limitations.
Bayesian multiple imputation with logistic regression addresses missing data in clinical research Austin et al. (2021) in clinical research by classifying missing values (e.g., patient refusal, loss to follow-up, mechanical errors) as MAR, MNAR, or MCAR. Multiple imputation generates plausible values across datasets and pools results for reliable classification of patient health status and mortality.
Aims
The present study aims performs Bayesian logistic regression to predict diabetes status and evaluate the associations between diabetes and predictors (body mass index (BMI), age (≥20 years), gender, and race). The study anakyzes a retrospective dataset (2013–2014 NHANES survey data). It is based on a complex sampling design, characterized by stratification, clustering, and oversampling of specific population subgroups, rather than uniform random sampling. A Bayesian analytical approach addresses challenges posed by dataset anomalies such as missing data, complete case analysis, and separation that limit the efficiency and reliability of traditional logistic regression in predicting health outcomes.
# Logical Flow for Analysis
Data Preparation Preparation and analysis of the observed or imputed dataset (e.g., adult_imp1.csv) to handle missing values and ensure completeness. (2)Train-Test Split Splitting the dataset into training (to fit the model) and testing (to evaluate model performance) sets to ensures the model is generalizable and avoids overfitting.
Frequentist Approach Fit classical regression or machine learning models on complete dataset to obtain point estimates (e.g., coefficients, odds ratios) and confidence intervals.
Bayesian Approach Fit a Bayesian model using on the imputed dataset to obtain posterior distributions (posterior draws) for parameters to quantify uncertainty. Use posterior draws to generate posterior predictive distributions on the complete data set.
Targeted Intervention Analysis Compare the two models Use of Bayesian models to simulate or assess interventions Identify modifiable risk factors (e.g., BMI, lifestyle factors). Predict how change in a risk factor affects outcomes (e.g., diabetes risk). Bayesian analysis to quantify uncertainty in intervention effects using posterior predictive distributions.
Inference & Decision-Making Combining insights from both approaches to guide data-driven decisions or public health recommendations. Frequentist results provide point estimates, while Bayesian results provide full uncertainty quantification.
# Method and Data Preparation - The study performs Bayesian logistic regression to predict diabetes status and estimate the association of key predictors (body mass index (BMI), age, gender, and race) using the 2013–2014 NHANES dataset. - The complex survey design of NHANES, with stratification, clustering, and oversampling was accounted for to ensure accurate estimation. - Posterior predictive distributions generated quantify the probability of diabetes for each individual, providing a flexible and robust framework for uncertainty quantification. - Model predictions are also compared against known population prevalence, enabling validation and assessment of model realism. - By integrating prior information with the observed data, identifies individuals at high risk for diabetes and informs targeted public health interventions, demonstrating the utility of Bayesian methods in analyzing complex, multivariate health data.
# Statistical Tool - R, R packages and libraries are used to import data, perform data wrangling and analysis. # Data source - NHANES 2-year data (2013-2014) - a cross-sectional weighted data Center for Health Statistics (1999).
Data pre-processing and cleaning - Three datasets: demographics, exam, questionnaire in.XPT format files are imported (Haven package) in R. After selecting variables of interest, a merged dataset is created from the original weighted datasets (demographics, exam, questionnaire) and merged using ID to create a single merged dataframe.
Response Variable: Binary
Type 2 / diagnosed diabetes(excluding gestational diabetes) -“Doctor told you have diabetes?” DIQ010 combined with DIQ050 a secondary variable describing treatment status (insulin use) to exclude those cases
Predictor Variables
Body Mass Index, factor, 4 levels are analyzed after standardization).
o Underweight (<5th percentile)
o Normal (5th–<85th)
o Overweight (85th–<95th) o Obese (≥95th percentile)
o Missing We kept it as it is as categorization provides clinically interpretable groups
Covariates:
Gender (factor, 2 levels): Male: Female
Ethnicity (factor, 5 levels): Mexican American, Non-Hispanic, White Non-Hispanic, Black Other Hispanic, Other Race - Including multi-racial
Age (number, continuous)
Merged dataset is cleaned and exploratory data analysis conducted to report results and visualizations for 10175 observations and 10 variables where - - race categorized in 5 levels - age range (0-80 years) - gender (male and female) - Diabetes grouped from (DIQ010 and DIQ050) BMI as continuous.
Code
# weighted means of each variable str(merged_data)
cat("Count with DIQ010 in {1,2}:", sum(merged_data$DIQ010 %in%c(1,2), na.rm =TRUE), "\n")
Count with DIQ010 in {1,2}: 9578
Code
# ---- Save to file for reuse ----dir.create("data", showWarnings =FALSE)# ---- Save ----dir.create("data", showWarnings =FALSE, recursive =TRUE)saveRDS(merged_data, "data/merged_2013_2014.rds")message("Saved: data/merged_2013_2014.rds")
# Exploratory data analysis - Using library(survey) we calculated weighted means and sd of all the variables. The BMI and age were standardized. - Age categorized was recoded into different variables, to create age range (20-80 years) >20 years - BMI is recoded and categorized as-“18.5,18.5–<25,25–<30,30–<35,35–<40,≥40 years). - Ethnicity is recoded as”Mexican American” = “1”, “Other Hispanic” = “2”, “NH White” = “3”, “NH Black” = “4”, “Other/Multi” = “5” - Since special codes are not random, cannot be dropped; the informative missingness if ignored (MAR or MNAR) could introduce bias.
We transformed special codes (3,7,) to NA and included all NAs in the analysis. Visualization of missing data is presented below.
A final analytic dataset created (‘adult’) with “NH White” and “Male” as the reference group for analysis
Code
## # ---------------- Basic Exploration (adults) ----------------# Keep adults only and build analysis variablesadult <- merged_data %>% dplyr::filter(RIDAGEYR >=20) %>% dplyr::transmute(# --- keep survey design variables so svydesign() can see them --- SDMVPSU, SDMVSTRA, WTMEC2YR,# --- outcome: DIQ010 (1 yes, 2 no; 3/7/9 -> NA) ---diabetes_dx = dplyr::case_when( DIQ010 ==1~1, DIQ010 ==2~0, DIQ010 %in%c(3, 7, 9) ~NA_real_,TRUE~NA_real_ ),# --- predictors (raw) ---bmi = BMXBMI,age = RIDAGEYR,# sex (1=Male, 2=Female)sex = forcats::fct_recode(factor(RIAGENDR), Male ="1", Female ="2"),# race (5-level)race = forcats::fct_recode(factor(RIDRETH1),"Mexican American"="1","Other Hispanic"="2","NH White"="3","NH Black"="4","Other/Multi"="5" ),# keep DIQ050 so we can safely reference it (may be absent/NA in some rows)DIQ050 = DIQ050 ) %>%# standardize continuous predictors dplyr::mutate(age_c =as.numeric(scale(age)),bmi_c =as.numeric(scale(bmi)),bmi_cat =cut( bmi,breaks =c(-Inf, 18.5, 25, 30, 35, 40, Inf),labels =c("<18.5","18.5–<25","25–<30","30–<35","35–<40","≥40"),right =FALSE ) ) %>%# adjust outcome: if female & DIQ050==1 ("only when pregnant"), set to 0 (not diabetes) dplyr::mutate(diabetes_dx =ifelse(sex =="Female"&!is.na(DIQ050) & DIQ050 ==1, 0, diabetes_dx) )# Make NH White the reference level for race (clearer interpretation)adult <- adult %>% dplyr::mutate(race = forcats::fct_relevel(race, "NH White") )# --- sanity checks ---cat("Adults n =", nrow(adult), "\n")
NHANES is a national surveys based on complex sampling designs (oversampling certain groups (e.g., minorities, older adults) to ensure representation. They use multistage sampling to represent the U.S. population, so we apply sampling weights, strata, and PSU (primary sampling units) for valid estimates.
We use survey design in regression anlaysis to avoid - - bias prevalence estimates (e.g., mean BMI or diabetes %) - underestimation of standard errors - incorrect inference for population-level parameters.
Code
# data explorationprint(table(adult$diabetes_dx, useNA ="ifany"))
0 1 <NA>
4974 618 177
Code
print(table(adult$sex, useNA ="ifany"))
Male Female
2758 3011
Code
print(table(adult$race, useNA ="ifany"))
NH White Mexican American Other Hispanic NH Black
2472 767 508 1177
Other/Multi
845
Code
if (sum(!is.na(adult$diabetes_dx)) ==0) {stop("Too few non-missing outcomes for modeling (n = 0). Check DIQ010 upstream.")}# (optional plots omitted for brevity)# save for downstreamif (!dir.exists("data")) dir.create("data", recursive =TRUE)saveRDS(adult, "data/adult_cleaned_2013_2014.rds")
Code
ggplot(adult, aes(x = age)) +geom_histogram(binwidth =5, fill ="skyblue", color ="white") +labs(title ="Distribution of Age >20 years",x ="Age (years)",y ="Count" ) +theme_minimal()
Code
ggplot(adult, aes(factor(diabetes_dx))) +geom_bar(fill ="steelblue") +labs(title="Diabetes Outcome Distribution in >20 years age group", x="diabetes_dx (0=No, 1=Yes)", y="Count")
Code
ggplot(adult, aes(factor(bmi_cat))) +geom_bar(fill ="steelblue") +labs(title="Diabetes Outcome Distribution by BMI in >20 years age group", x="bmi_cat")
Code
ggplot(adult, aes(x =factor(diabetes_dx), y = bmi)) +geom_boxplot(fill ="skyblue") +labs(title ="BMI Distribution by Diabetes Diagnosis in >20 years age group",x ="Diabetes Diagnosis (0 = No, 1 = Yes)",y ="BMI" ) +theme_minimal()
Code
# plots for adult data bmi categories and race categoriesggplot(adult, aes(x =factor(race), fill =factor(diabetes_dx))) +geom_bar(position ="dodge") +labs(title ="Diabetes Diagnosis by Race in >20 years age group",x ="Race/Ethnicity",y ="Count",fill ="Diabetes Diagnosis\n(0 = No, 1 = Yes)" ) +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1))
Code
ggplot(adult, aes(x =factor(bmi_cat), fill =factor(diabetes_dx))) +geom_bar(position ="dodge") +labs(title ="Diabetes Diagnosis by BMI in >20 years age group",x ="BMI",y ="Count",fill ="Diabetes Diagnosis\n(0 = No, 1 = Yes)" ) +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1))
# Statistical Methods and Modeling
# Multiple Logistic regression on survey weighted dataset We conducted frequentist method Multiple Logistic regression on a survey-weighted dataset, for complete case analysis and data exploration
# Multivariate Imputation by Chained Equations - We conducted MICE to manage missiging data as an alternative to the Bayesian Approach Buuren and Groothuis-Oudshoorn (2011) - Flatness of the density, heavy tails, non-zero peakedness, skewness and multimodality do not hamper the good performance of multiple imputation for the mean structure in samples n > 400 even for high percentages (75%) of missing data in one variable Van Buuren and Van Buuren (2012). - Multiple Imputation (MI) can be performed using mice package in R - Iterative mice imputes missing values of one variable at a time, using regression models based on the other variables in the dataset. - In the chain process, each imputed variable become a predictor for the subsequent imputation, and the entire process is repeated multiple times to create several complete datasets, each reflecting different possibilities for the missing data.
# Bayesian Logistic Regression - We conduct bayesian logisitic regression to estimate the association between BMI, age, sex, and race/ethnicity and predict doctor-diagnosed diabetes (DIQ010). - Bayesian statistics is about updating beliefs with evidence: Posterior ∝ Likelihood × Prior - Prior (p(θ)): Your initial belief about a parameter before seeing the data. - Likelihood (p(y|θ)): How probable the observed data are given the parameters. This is derived from the model (e.g., logistic regression likelihood). - Posterior (p(θ|y)): Your updated belief about the parameter after seeing the data. - We selected Bayesian Logistic Regression since our study response variable is a binary outcome (Diabetes:yes/no) - Bayesian Logistic Regression is based on binomial probability Bayes’ rules, and predicts probability of disease outcome - Bayes analyzes linear relation between the predictor (Age, Race, BMI, Gender) and outcome response variable (Diabetes). - Bayes considers that predictors and response variables are independent. - Regression a of a discrete variable (0 or 1) is a Bernoulli probability model that classifies categorical response variables - predicting Diabetes. - Logit link provides probabilities for the response variable. - Prior
We used student_t(3, 0, 10) prior R. V. D. Schoot et al. (2013) - student_t(ν,μ,σ) Student’s t-distribution with: ν=3: degrees of freedom (controls tail heaviness) μ=0: location (center, like the mean) σ=10: scale (spread, like standard deviation) student_t(3, 0, 10) means: It’s centered at 0 It allows large values (± several times 10) It has heavy tails, since ν=3, allows for outliers or unexpected large parameter values Helps the model remain stable, especially with: # Small datasets # High correlation between predictors # Potential outliers in the data - In Bayesian statistics, every unknown parameter (like a regression coefficient, mean, or variance) is treated as a random variable with a probability distribution that reflects uncertainty.
# Summary of Bayesian regression is presented under the following headings - Posterior Predictive Probabilities - Posterior Mean, Median, credible Intervals - Posterior Probability (Outcome=1) - Comparison with External Prevalence (population prevalence) - Posterior Model Fit Metrics - Prior versus Posterior Coefficient Distributions - Posterior Predictive Checks - Uncertainty Quantification
\[ y_i = \beta_0 + \beta_1 X_1 +\varepsilon_i \] ## Diagnostics for Bayesian Logictic Regression is presented under the following headings
Trace Plots for Markov Chain Monte Carlo Convergence
Autocorrelation Plots
Posterior Predictive Checks
Residual Analysis
Prior Sensitivity Analysis
Model Fit Assessment
Posterior Predictive Probability Plots
Posterior Interval Coverage Evaluation
Convergence Diagnostics across Chains
Code
# Modelinglibrary(broom)library(mice)library(brms)library(posterior)library(bayesplot)library(knitr)# --- Guardrails for modeling ---n_outcome <-sum(!is.na(adult$diabetes_dx))if (n_outcome ==0) stop("Too few non-missing outcomes for modeling. n = 0")# Ensure factors and >=2 observed levels among complete outcomesadult <- adult %>% dplyr::mutate(sex =if (!is.factor(sex)) factor(sex) else sex,race =if (!is.factor(race)) factor(race) else race )if (nlevels(droplevels(adult$sex[!is.na(adult$diabetes_dx)])) <2)stop("sex has <2 observed levels after filtering; check data availability.")if (nlevels(droplevels(adult$race[!is.na(adult$diabetes_dx)])) <2)stop("race has <2 observed levels after filtering; check Data Prep.")# Survey-weighted complete-case # Build a logical filter on the original adult data (same length as design$data)keep_cc <-with( adult,!is.na(diabetes_dx) &!is.na(age_c) &!is.na(bmi_c) &!is.na(sex) &!is.na(race))# Subset the survey design using the logical vector (same length as original)des_cc <-subset(nhanes_design_adult, keep_cc)# Corresponding complete-case data (optional)cc <- adult[keep_cc, ] |>droplevels()cat("\nComplete-case N for survey-weighted model:", nrow(cc), "\n")
Complete-case N for survey-weighted model: 5349
Code
print(table(cc$race))
NH White Mexican American Other Hispanic NH Black
2293 713 470 1101
Other/Multi
772
Code
print(table(cc$diabetes_dx))
0 1
4752 597
Code
print(table(cc$sex))
Male Female
2551 2798
Code
form_cc <- diabetes_dx ~ age_c + bmi_c + sex + racesvy_fit <- survey::svyglm(formula = form_cc, design = des_cc, family =quasibinomial())summary(svy_fit)
Call:
svyglm(formula = form_cc, design = des_cc, family = quasibinomial())
Survey design:
subset(nhanes_design_adult, keep_cc)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.67143 0.11935 -22.383 1.68e-08 ***
age_c 1.10833 0.05042 21.981 1.94e-08 ***
bmi_c 0.63412 0.05713 11.099 3.88e-06 ***
sexFemale -0.63844 0.10926 -5.843 0.000386 ***
raceMexican American 0.71091 0.13681 5.196 0.000826 ***
raceOther Hispanic 0.46469 0.13474 3.449 0.008712 **
raceNH Black 0.51221 0.15754 3.251 0.011677 *
raceOther/Multi 0.84460 0.17756 4.757 0.001433 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for quasibinomial family taken to be 0.8455444)
Number of Fisher Scoring iterations: 6
priors <-c(set_prior("normal(0, 2.5)", class ="b"),set_prior("student_t(3, 0, 10)", class ="Intercept") )bayes_fit <-brm(formula = diabetes_dx |weights(wt_norm) ~ age_c + bmi_c + sex + race,data = adult_imp1,family =bernoulli(link ="logit"),prior = priors,chains =4, iter =2000, seed =123,control =list(adapt_delta =0.95),refresh =0# quiet Stan output)
Running /opt/R/4.4.2/lib/R/bin/R CMD SHLIB foo.c
using C compiler: ‘gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-2)’
gcc -I"/opt/R/4.4.2/lib/R/include" -DNDEBUG -I"/opt/R/4.4.2/lib/R/library/Rcpp/include/" -I"/opt/R/4.4.2/lib/R/library/RcppEigen/include/" -I"/opt/R/4.4.2/lib/R/library/RcppEigen/include/unsupported" -I"/opt/R/4.4.2/lib/R/library/BH/include" -I"/opt/R/4.4.2/lib/R/library/StanHeaders/include/src/" -I"/opt/R/4.4.2/lib/R/library/StanHeaders/include/" -I"/opt/R/4.4.2/lib/R/library/RcppParallel/include/" -I"/opt/R/4.4.2/lib/R/library/rstan/include" -DEIGEN_NO_DEBUG -DBOOST_DISABLE_ASSERTS -DBOOST_PENDING_INTEGER_LOG2_HPP -DSTAN_THREADS -DUSE_STANC3 -DSTRICT_R_HEADERS -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION -D_HAS_AUTO_PTR_ETC=0 -include '/opt/R/4.4.2/lib/R/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp' -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1 -I/usr/local/include -fpic -g -O2 -c foo.c -o foo.o
In file included from /opt/R/4.4.2/lib/R/library/RcppEigen/include/Eigen/Core:19,
from /opt/R/4.4.2/lib/R/library/RcppEigen/include/Eigen/Dense:1,
from /opt/R/4.4.2/lib/R/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp:22,
from <command-line>:
/opt/R/4.4.2/lib/R/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:679:10: fatal error: cmath: No such file or directory
679 | #include <cmath>
| ^~~~~~~
compilation terminated.
make: *** [/opt/R/4.4.2/lib/R/etc/Makeconf:195: foo.o] Error 1
Code
prior_summary(bayes_fit)
prior class coef group resp dpar nlpar lb ub
normal(0, 2.5) b
normal(0, 2.5) b age_c
normal(0, 2.5) b bmi_c
normal(0, 2.5) b raceMexicanAmerican
normal(0, 2.5) b raceNHBlack
normal(0, 2.5) b raceOtherDMulti
normal(0, 2.5) b raceOtherHispanic
normal(0, 2.5) b sexFemale
student_t(3, 0, 10) Intercept
tag source
user
(vectorized)
(vectorized)
(vectorized)
(vectorized)
(vectorized)
(vectorized)
(vectorized)
user
Code
summary(bayes_fit) # Bayesian model summary
Family: bernoulli
Links: mu = logit
Formula: diabetes_dx | weights(wt_norm) ~ age_c + bmi_c + sex + race
Data: adult_imp1 (Number of observations: 5592)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept -2.66 0.09 -2.84 -2.50 1.00 4187 3510
age_c 1.09 0.06 0.97 1.22 1.00 3012 3098
bmi_c 0.63 0.05 0.53 0.72 1.00 3472 3315
sexFemale -0.66 0.10 -0.86 -0.46 1.00 4003 3052
raceMexicanAmerican 0.69 0.18 0.35 1.04 1.00 3526 2843
raceOtherHispanic 0.43 0.24 -0.07 0.89 1.00 4058 3114
raceNHBlack 0.54 0.15 0.24 0.82 1.00 3597 3177
raceOtherDMulti 0.82 0.19 0.45 1.19 1.00 3763 3257
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
# Bayesian Logistic Regression
Bayesian Logistic Regression is performed on the imputed dataset from the pooled imputed adult dataset (adult_imp1) and provided the Bayesian model summary statistics, visualization, summary tables.
# Summaries of numeric predictorsnumeric_vars <-sapply(adult_imp1, is.numeric)summary(adult_imp1[, numeric_vars])
diabetes_dx age bmi WTMEC2YR
Min. :0.0000 Min. :20.00 Min. :14.1 Min. : 0
1st Qu.:0.0000 1st Qu.:34.00 1st Qu.:24.1 1st Qu.: 18785
Median :0.0000 Median :48.00 Median :27.7 Median : 27803
Mean :0.1105 Mean :48.84 Mean :29.0 Mean : 39779
3rd Qu.:0.0000 3rd Qu.:63.00 3rd Qu.:32.4 3rd Qu.: 52293
Max. :1.0000 Max. :80.00 Max. :82.9 Max. :171395
SDMVPSU SDMVSTRA age_c bmi_c
Min. :1.000 Min. :104.0 Min. :-1.65751 Min. :-2.09476
1st Qu.:1.000 1st Qu.:107.0 1st Qu.:-0.86038 1st Qu.:-0.69669
Median :1.000 Median :111.0 Median :-0.06326 Median :-0.19338
Mean :1.486 Mean :110.9 Mean :-0.01563 Mean :-0.01133
3rd Qu.:2.000 3rd Qu.:115.0 3rd Qu.: 0.79079 3rd Qu.: 0.46371
Max. :2.000 Max. :118.0 Max. : 1.75873 Max. : 7.52398
wt_norm
Min. :0.0000
1st Qu.:0.4729
Median :0.7000
Mean :1.0014
3rd Qu.:1.3165
Max. :4.3150
Code
# Summaries of factor predictorsfactor_vars <-sapply(adult_imp1, is.factor)summary(adult_imp1[, factor_vars])
sex race
Male :2669 NH White :2398
Female:2923 Mexican American: 742
Other Hispanic : 489
NH Black :1141
Other/Multi : 822
Code
# Save your dataset as CSVwrite.csv(adult_imp1, "adult_imp1.csv", row.names =FALSE)
# Training and testing To evaluate how well the model learned patterns from the training dataset, and how reliably it will perform on new data in real-world scenarios, we decided to perform Training/Testing on the imputed data
# Testing set to evaluate model performance on unseen data. - Model Accuracy / Performance - Generalizability - Helps check for overfitting or underfitting - Reliability of Predictions - Confidence on model predictions for new individuals - Distribution and Bias Check - to ensure the training and testing sets are representative of the same population
Results from training and test data- Training (n=4473) & Testing (n=1119) sets: Age ~48 yrs, BMI ~28, bmi_c ~0, wt_norm ~1; ranges similar across sets.
Code
#Code hereimport pandas as pd# Load the data exported from Radult_data_py = pd.read_csv("adult_imp1.csv") # your R dataset saved as CSV# Now you can rename it freely in Pythondf = adult_data_py # just create a new variable pointing to the same DataFrame# Check the datadf.head()
diabetes_dx age bmi sex ... SDMVSTRA age_c bmi_c wt_norm
0 1 69 26.7 Male ... 112 1.132418 -0.333192 0.339392
1 1 54 28.6 Male ... 108 0.278360 -0.067558 0.616088
2 1 72 28.9 Male ... 109 1.303230 -0.025616 1.439868
3 0 73 19.7 Female ... 116 1.360167 -1.311843 1.650048
4 0 56 41.7 Male ... 111 0.392234 1.763918 0.638072
[5 rows x 11 columns]
Code
import pandas as pdfrom sklearn.model_selection import train_test_split# Optional: rename columns if neededdf = df.rename(columns={"diabetes_dx": "diabetes_status"})# Separate predictors (X) and outcome (y)X = df.drop(columns=["diabetes_status"])y = df["diabetes_status"]# Create 80/20 train-test split, stratifying by outcome to keep class balanceX_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=1234, stratify=y)# Combine X and y if you want full train/test datasetstrain_data = pd.concat([X_train, y_train], axis=1)test_data = pd.concat([X_test, y_test], axis=1)# Check sizesprint("Training set:", train_data.shape)
Training set: (4473, 11)
Code
print("Testing set: ", test_data.shape)
Testing set: (1119, 11)
Code
# Check class balanceprint("\nTraining set class distribution:\n", y_train.value_counts(normalize=True))
Training set class distribution:
diabetes_status
0 0.88956
1 0.11044
Name: proportion, dtype: float64
Code
print("\nTesting set class distribution:\n", y_test.value_counts(normalize=True))
Testing set class distribution:
diabetes_status
0 0.889187
1 0.110813
Name: proportion, dtype: float64
Code
####import matplotlib.pyplot as pltimport seaborn as snsimport pandas as pd# Combine train and test distributions into a DataFrametrain_dist = pd.DataFrame({'Class': y_train.unique(), 'Proportion': y_train.value_counts(normalize=True).values,'Set': 'Train'})test_dist = pd.DataFrame({'Class': y_test.unique(), 'Proportion': y_test.value_counts(normalize=True).values,'Set': 'Test'})dist_df = pd.concat([train_dist, test_dist])# Plotplt.figure(figsize=(8,5))sns.barplot(x='Class', y='Proportion', hue='Set', data=dist_df)plt.title('Class Distribution in Training and Testing Sets')plt.ylabel('Proportion')plt.xlabel('Diabetes Status')plt.ylim(0,1)
(0.0, 1.0)
Code
plt.show()
Code
# Numeric summaries for predictorsprint("Training set numeric summary:\n", X_train.describe())
Training set numeric summary:
age bmi ... bmi_c wt_norm
count 4473.000000 4473.000000 ... 4473.000000 4473.000000
mean 48.926895 28.933736 ... -0.020899 1.001234
std 17.571583 6.935321 ... 0.969609 0.786057
min 20.000000 14.100000 ... -2.094764 0.000000
25% 34.000000 24.100000 ... -0.696691 0.477152
50% 48.000000 27.700000 ... -0.193384 0.700259
75% 63.000000 32.300000 ... 0.449729 1.286732
max 80.000000 77.500000 ... 6.769021 4.314957
[8 rows x 8 columns]
Code
print("\nTesting set numeric summary:\n", X_test.describe())
Training (n=4473) & Testing (n=1119) sets: Age ~48 yrs, BMI ~28, bmi_c ~0, wt_norm ~1; ranges similar across sets.
Prior - We used student_t(3, 0, 10) prior R. V. D. Schoot et al. (2013) - student_t(ν,μ,σ)
Student’s t-distribution with: ν=3: degrees of freedom (controls tail heaviness) μ=0: location (center, like the mean) σ=10: scale (spread, like standard deviation) student_t(3, 0, 10) means: It’s centered at 0 It allows large values (± several times 10) It has heavy tails, since ν=3, allows for outliers or unexpected large parameter values Helps the model remain stable, especially with: # Small datasets # High correlation between predictors # Potential outliers in the data
# Bayesian Logistic Regression Analysis
Code
###library(ggplot2)# priors for two coefficients (age and bmi) prior = N(0,2.5)prior_draws <-tibble(term =rep(c("Age (per 1 SD)", "BMI (per 1 SD)"), each =4000),value =c(rnorm(4000, 0, 2.5), rnorm(4000, 0, 2.5)))ggplot(prior_draws, aes(x = value, fill = term)) +geom_density(alpha =0.5) +theme_minimal() +labs(title ="Prior Distributions for Coefficients",x ="Coefficient Value", y ="Density") +scale_fill_manual(values =c("skyblue", "orange"))
Code
# Diabetes vs BMIlibrary(ggplot2)# Create the plotggplot(adult_imp1, aes(x =factor(diabetes_dx), y = bmi, fill =factor(diabetes_dx))) +geom_boxplot(alpha =0.7) +scale_x_discrete(labels =c("0"="No Diabetes", "1"="Diabetes")) +labs(x ="Diabetes Diagnosis",y ="BMI",title ="BMI Distribution by Diabetes Status" ) +theme_minimal() +theme(legend.position ="none")
Code
# logistic regression curveggplot(adult_imp1, aes(x = bmi, y = diabetes_dx)) +geom_point(aes(y = diabetes_dx), alpha =0.2, position =position_jitter(height =0.02)) +geom_smooth(method ="glm", method.args =list(family ="binomial"), se =TRUE, color ="blue") +labs(x ="BMI",y ="Probability of Diabetes",title ="Predicted Probability of Diabetes vs BMI" ) +theme_minimal()
# Posterior Draws Once we get posterior draws, we calculated summary stats Mean, median, 95% credible intervals summary(bayes_fit) or posterior_summary(bayes_fit) -
# Visualization
Distribution shape mcmc_hist(posterior, pars = c(“b_age”)) or mcmc_areas()
Pairwise plots Correlations between parameters mcmc_pairs(posterior)
Posterior predictive checks Compare model predictions vs observed data pp_check(bayes_fit)
# Assumptions for Bayesian logistic regression - posterior check to check for linearity - mcmc trace plots for convergence - bayes_R2 for model fit
Code
summary(bayes_fit)
Family: bernoulli
Links: mu = logit
Formula: diabetes_dx | weights(wt_norm) ~ age_c + bmi_c + sex + race
Data: adult_imp1 (Number of observations: 5592)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept -2.66 0.09 -2.84 -2.50 1.00 4187 3510
age_c 1.09 0.06 0.97 1.22 1.00 3012 3098
bmi_c 0.63 0.05 0.53 0.72 1.00 3472 3315
sexFemale -0.66 0.10 -0.86 -0.46 1.00 4003 3052
raceMexicanAmerican 0.69 0.18 0.35 1.04 1.00 3526 2843
raceOtherHispanic 0.43 0.24 -0.07 0.89 1.00 4058 3114
raceNHBlack 0.54 0.15 0.24 0.82 1.00 3597 3177
raceOtherDMulti 0.82 0.19 0.45 1.19 1.00 3763 3257
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
# Estimated R² - Estimated R² = 0.13 (13.1%) -suggests predictors are relevant, most of the variability remains unexplained. - Other factors (like genetics, lifestyle, environment, etc.) might be contributing.
Below are the reported odds ratio from the posterior predicted values and the Bayesian regression
#PP check for proportions (useful for binary) # mean comparison## to check if the simulated means match the observed mean## meanppc_stat(y = adult_imp1$diabetes_dx, yrep = pp_samples[1:100, ], stat ="mean") +labs(title ="Posterior Predictive Check: Mean of Replicates") +theme_minimal()
Code
## sdppc_stat(y = adult_imp1$diabetes_dx, yrep = pp_samples[1:100, ], stat ="sd") +labs(title ="PPC: Standard Deviation of Replicates") +theme_minimal()
Code
# PP checks with bayesplot optionscolor_scheme_set("blue")ppc_scatter_avg(y = adult_imp1$diabetes_dx, yrep = pp_samples[1:100, ]) +labs(title ="Observed vs Predicted (Avg) Posterior Predictive")
Comparative Visualizations - Predicted vs observed - to check how well the model’s predictions align with reality where mean(y_rep) = average predicted probability of diabetes for each individual, across all posterior draws of the parameters. y = the actual observed diabetes status (0 = non-diabetic, 1 = diabetic). - mcmc plot (dens plots) comparing observed and posterior parameter values (estimates) for bmi_c, age_c, sex_female, and by race categories - Fitted (Predicted) vs observed for bmi using point and error bars - Fitted (Predicted) vs observed for bmi using line plot
library(posterior)library(bayesplot)# Extract posterior draws as a draws_df # simulate posterior outcomespost <-as_draws_df(bayes_fit)# Check parameter namescolnames(post)
# Density overlay for age and bmimcmc_areas(post, pars =c( "b_age_c","b_bmi_c","b_sexFemale","b_raceMexicanAmerican", "b_raceOtherHispanic","b_raceNHBlack","b_raceOtherDMulti" ))
# Combine observed and predicted into one data frameplot_data <- adult_imp1 %>%mutate(predicted_bmi = predicted[, "Estimate"],lower_ci = predicted[, "Q2.5"],upper_ci = predicted[, "Q97.5"],obs_index =1:nrow(adult_imp1) # index for x-axis )# Line plotggplot(plot_data, aes(x = obs_index)) +geom_line(aes(y = bmi, color ="Observed")) +# observed BMIgeom_line(aes(y = predicted_bmi, color ="Predicted")) +# predicted BMIgeom_ribbon(aes(ymin = lower_ci, ymax = upper_ci), alpha =0.2) +# uncertaintylabs(x ="Observation", y ="BMI", color ="Legend") +theme_minimal()
Code
summary(adult_imp1$bmi)
Min. 1st Qu. Median Mean 3rd Qu. Max.
14.1 24.1 27.7 29.0 32.4 82.9
Code
summary(plot_data$bmi_c)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.09476 -0.69669 -0.19338 -0.01133 0.46371 7.52398
Centering for bmi - Summary of original bmi (observed data) and centered version of BMI. - Centering doesn’t change the distribution shape, only shifts it so the mean is zero. - Centering is useful in regression/Bayesian models to improve numerical stability and interpretability of intercepts - Plots showing predicted values of bmi and age (prior vs predicted) and the proportion of diabetes=1 for each draw
Code
prior_summary(bayes_fit)
prior class coef group resp dpar nlpar lb ub
normal(0, 2.5) b
normal(0, 2.5) b age_c
normal(0, 2.5) b bmi_c
normal(0, 2.5) b raceMexicanAmerican
normal(0, 2.5) b raceNHBlack
normal(0, 2.5) b raceOtherDMulti
normal(0, 2.5) b raceOtherHispanic
normal(0, 2.5) b sexFemale
student_t(3, 0, 10) Intercept
tag source
user
(vectorized)
(vectorized)
(vectorized)
(vectorized)
(vectorized)
(vectorized)
(vectorized)
user
term estimate type
Length:8000 Min. :-3.585220 Length:8000
Class :character 1st Qu.:-0.697054 Class :character
Mode :character Median : 0.000046 Mode :character
Mean :-0.007350
3rd Qu.: 0.668561
Max. : 3.922817
library(brms)library(tidyr)# Extract posterior drawspost <-as_draws_df(bayes_fit) %>%# bayes_fit = your brms modelselect(b_bmi_c, b_age_c) %>%# select your coefficient columnspivot_longer(everything(),names_to ="term",values_to ="estimate" ) %>%mutate(term =case_when( term =="b_bmi_c"~"BMI (per 1 SD)", term =="b_age_c"~"Age (per 1 SD)" ),type ="Posterior" )## visualization of prior and predicted drawscombined_draws <-bind_rows(prior_draws, post) library(ggplot2)ggplot(combined_draws, aes(x = estimate, fill = type)) +geom_density(alpha =0.4) +facet_wrap(~ term, scales ="free", ncol =2) +theme_minimal(base_size =13) +labs(title ="Prior vs Posterior Distributions",x ="Coefficient estimate",y ="Density",fill ="" )
Code
#### Compute proportion of diabetes=1 for each drawpp_proportion <-rowMeans(pp_samples) # proportion of 1's in each posterior draw# Summary of posterior proportionssummary(pp_proportion)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.09067 0.10533 0.10944 0.10935 0.11320 0.12715
Code
# Optional: visualize the posterior probability distributionpp_proportion_df <-tibble(proportion = pp_proportion)ggplot(pp_proportion_df, aes(x = proportion)) +geom_histogram(binwidth =0.01, fill ="skyblue", color ="black") +labs(title ="Posterior Distribution of Proportion of Diabetes = 1",x ="Proportion of Diabetes = 1",y ="Frequency" ) +theme_minimal()
Visualization - Prior and predicted draws - Posterior predicted proportion of Diabetes vs NHANES prevalence of Diabetes in the population
Code
library(tidyverse)# Posterior predicted proportion vector# pp_proportion <- rowMeans(pp_samples) # if not already doneknown_prev <-0.089# NHANES prevalence# Posterior summaryposterior_mean <-mean(pp_proportion)posterior_ci <-quantile(pp_proportion, c(0.025, 0.975)) # 95% credible interval# Create a data frame for plottingpp_df <-tibble(proportion = pp_proportion)# Plotggplot(pp_df, aes(x = proportion)) +geom_histogram(binwidth =0.005, fill ="skyblue", color ="black") +geom_vline(xintercept = known_prev, color ="red", linetype ="dashed", size =1) +geom_vline(xintercept = posterior_mean, color ="blue", linetype ="solid", size =1) +geom_rect(aes(xmin = posterior_ci[1], xmax = posterior_ci[2], ymin =0, ymax =Inf),fill ="blue", alpha =0.1, inherit.aes =FALSE) +labs(title ="Posterior Predicted Diabetes Proportion vs NHANES Prevalence",subtitle =paste0("Red dashed = NHANES prevalence (", known_prev, "), Blue solid = Posterior mean (", round(posterior_mean,3), ")"),x ="Proportion of Diabetes = 1",y ="Frequency" ) +theme_minimal()
Code
library(dplyr)# Posterior predicted proportionposterior_mean <-mean(pp_proportion)posterior_ci <-quantile(pp_proportion, c(0.025, 0.975)) # 95% credible interval# NHANES prevalence with SE from survey::svymean# Suppose you already have:# svymean(~diabetes_dx, nhanes_design_adult, na.rm = TRUE)known_prev <-0.089# Mean prevalenceknown_se <-0.0048# Standard error from survey# Calculate 95% confidence intervalknown_ci <-c( known_prev -1.96* known_se, known_prev +1.96* known_se)# Print resultsdata.frame(Type =c("Posterior Prediction", "NHANES Prevalence"),Mean =c(posterior_mean, known_prev),Lower_95 =c(posterior_ci[1], known_ci[1]),Upper_95 =c(posterior_ci[2], known_ci[2]))
Type Mean Lower_95 Upper_95
2.5% Posterior Prediction 0.1093519 0.09781831 0.1212446
NHANES Prevalence 0.0890000 0.07959200 0.0984080
Code
library(ggplot2)library(dplyr)# Create a data frame for plottingci_df <-data.frame(Type =c("Posterior Prediction", "NHANES Prevalence"),Mean =c(0.1096674, 0.089),Lower_95 =c(0.09772443, 0.079592),Upper_95 =c(0.1210658, 0.098408))# Plotggplot(ci_df, aes(x = Type, y = Mean, color = Type)) +geom_point(size =4) +geom_errorbar(aes(ymin = Lower_95, ymax = Upper_95), width =0.2) +ylim(0, max(ci_df$Upper_95) +0.02) +labs(title ="Comparison of Posterior Predicted Diabetes Proportion vs NHANES Prevalence",y ="Proportion of Diabetes",x ="" ) +theme_minimal(base_size =14) +theme(legend.position ="none")
# Results
# Multiple Linear Regression Equation
Model**:**
`svyglm(formula = form_cc, design = des_cc, family = quasibinomial())`
All predictors are significant: age (p < 0.001) and BMI (p < 0.001) show strong positive associations with the outcome, while being female (p = 0.0004) is negatively associated. Other significant associations include raceMexican American (p = 0.0008), raceOther Hispanic (p = 0.0087), raceNH Black (p = 0.0117), and raceOther/Multi (p = 0.0014).
# Multivariate Imputation by Chained Equations
All predictors are statistically significant.
Positive associations: age (p \< 0.001), BMI (p \< 0.001), and all
race categories compared to reference.
Negative association: being female (p \< 0.001)
# Convergence & Diagnostics - Rhat = 1.00 for all parameters → excellent convergence - Bulk_ESS / Tail_ESS: Large values (>2000) → high effective sample sizes, reliable posterior estimates.
# Interpretation
Strong predictors: Age and BMI are strongly positively associated with diabetes risk.
Sex effect: Females have a lower probability of diabetes compared to males
R² = 0.13 shows 13% of the variance in the outcome (diabetes_dx) is explained by your predictors (age, BMI, sex, race), 95% Credible Interval: 0.106–0.156, indicates that, given your model and data, the true proportion of variance explained is plausibly between 10.6% and 15.6% and shows uncertainty in the explained variance, which is natural for probabilistic models.
Est.Error = 0.0127, reflects the standard error of the R² estimate across posterior samples. The small SE indicates that the R² estimate is fairly precise.
Race/ethnicity: Mexican American, NH Black, and “Other/Diverse” groups have higher odds of diabetes. Other Hispanic group has a less certain effect.
age_c-Each 1-unit increase in centered age increases the log-odds of diabetes by 1.09. Strong positive effect.
bmi_c-Higher BMI is associated with higher diabetes risk. sexFemale-Females have lower odds of diabetes compared to males.
raceMexicanAmerican-Higher odds of diabetes vs. reference race (likely NH White)
raceOtherHispanic-Slightly higher odds vs reference, but interval crosses zero → uncertain effect.
raceNHBlack-Significantly higher odds of diabetes compared to reference. raceOtherDMulti-Higher odds of diabetes vs reference group.
# Posterior distribution of all parameters in the model. Density - plot of posterior samples each parameter (e.g., intercept, slope) into a smoothed density curve, showing most of the posterior probability mass lies for best estimates and uncertainty.
# Posterior Predictive Distribution - generated from posterior - predictive draws: 𝑦rep∼𝑝(𝑦new∣𝜃)yrep∼p(ynew∣θ) simulate the data given posterior parameter estimates. - Posterior predictive checks (PPC) compare these simulations to real data to assess model fit.
# Incorporating two sources of uncertainty: Parameter uncertainty: captured in the posterior distributions Predictive uncertainty: captured in posterior predictive draws
Combining the two provide credible intervals for predictions, not just point predictions and specifies - Given the BMI, the probability of diabetes is likely between 40–55%.
# Autocorrelation Interpretation - To show the correlation of a sample with previous samples (lags) in the MCMC chain. Lag indicates the number of steps between samples. Lag 0 is always 1 (perfect correlation with itself). - Each chain (1–4) shown separately for b_age_c (age coefficient) and b_bmi_c (BMI coefficient) presents autocorrelation drops quickly to near zero after lag 1–2 for both coefficients in all chains suggesting good mixing: successive samples are mostly independent after a short lag. - No persistent high autocorrelation indicates MCMC chains are converging well. - Low autocorrelation, as in your plot, is desirable because it means your posterior estimates are reliable and not biased by chain dependence.
# Comparing Models
All three models (survey-weighted MLE, multiple imputation, Bayesian) agree closely on the direction and magnitude of the effects of BMI and age.
Age is a stronger predictor than BMI, nearly tripling the odds per 1 SD.
BMI significantly increases diabetes risk (~1.7–1.9× per 1 SD).
Differences between models are minor, indicating robust and reliable findings despite missing data or modeling approach.
# Diabetes Predicted proportion vs population prevalence - Posterior predicted proportion is slightly higher than the observed NHANES prevalence (10.97% vs 8.9%). - Intervals overlap slightly, but the posterior tends to overestimate diabetes compared to NHANES.
The difference could be due to: - Model assumptions (logistic link, priors) - Predictor effects (BMI, age, sex, race) - Sample characteristics vs population weighting in NHANES
# Conclusion The results find our model is reasonable but slightly conservative (predicting higher risk) relative to the observed population prevalence.
Across multiple modeling approaches—survey-weighted maximum likelihood, multiple imputation, and Bayesian regression—both age and BMI were consistently strong predictors of diabetes. Eachstandard deviation increase in age nearly tripled the odds of diabetes, while a similar increase in BMI elevated the odds by approximately 1.7–1.9 times. The consistency of these results across models highlights the robustness of the associations and underscores the importance of age and BMI as key risk factors for diabetes in this population.
Effect stability: point estimates in rhe Bayesian model’s closely aligned with those from the frequentist, indicating that prior regularization stabilized the estimates in the presence of modest missingness.
Uncertainty quantification: Bayesian credible intervals of odds ration were slightly narrower yet overlapped the frequentist confidence intervals, suggest comparable inferential precision while offering improved interpretability.
Design considerations: # Survey-weighted MLE (Maximum Likelihood Estimator) - incorporates each observation weighted according to its survey weight. - provide estimates that reflect the population-level parameters, not just the sample- produces population-representative estimates. # Bayesian model with normalized weights- - instead of fully modeling the survey design, it used normalized sampling weights as importance weights - the scaled weights that sum to the sample size approximates the effect of survey weights, but does not fully account for: Stratification, clustering, design-based variance adjustments. - Bayesian inference treats the weighted likelihood as from a regular model, ignoring some survey design features.
# Discussions
The use of multiple imputation allowed for robust analysis despite missing data, increasing power and reducing bias. Comparison of frequentist and Bayesian models demonstrated consistency in significant predictors, while Bayesian approaches provided the advantage of posterior distributions and probabilistic interpretation. The = Across all models, both age and BMI emerged as strong and consistent predictors of diabetes. The consistency across modeling approaches strengthens the validity of these findings Multiple imputation accounted for potential biases due to missing data, and Bayesian modeling provided robust credible intervals that closely matched frequentist estimates. align with previous epidemiological research indicating that increasing age and higher BMI are among the most important determinants of type 2 diabetes risk.Cumulative exposure to metabolic and lifestyle risk factors over time, and the role of excess adiposity and insulin related effects account for diabetes. Survey weighted dataset strenghthens ensuring population representativeness, multiple imputation to handle missing data, and rigorous Bayesian estimation provided high effective sample sizes and R̂ ≈ 1.00 across parameters confirmed excellent model convergence. Bayesian logistic regression provided inference statistically consistent and interpretable achieving the aim of this study. In future research hierarchical model using NHANES cycles or adding variables (lab tests) could assess nonlinear effects of metabolic risk factors.
# Limitations
The study is based on cross-sectional/observational NHANES data, which limits the ability to make causal inferences. Associations observed between BMI, age, diabetes status cannot confirm causation.
The project relies on multiple imputation for missing values, even though imputation reduces bias, it assumes missingness is at random (MAR); if data are missing not at random (MNAR), results may be biased.
Potential Residual Confounding - Models included key predictors (age, BMI, sex, race), but unmeasured factors like diet, physical activity, socioeconomic status, or genetic predisposition could confound associations.
Bayesian models depend on prior choices, which could influence posterior estimates if priors are informative or mis-specified.
Variable Measurement - BMI is measured at a single time point, which may not reflect long-term exposure or risk.
Self-reported variables - are subjective to recall or reporting bias.
Interactions are not tested in the study (bmi × age) and so other potential synergistic effects might be missed.
Predicted probabilities are model-based estimates, not validated for clinical decision-making. External validation in independent cohorts is needed before application.
# QUESTION for targeted therapy
Translational Perspective from the Bayesian Diabetes Prediction Project. This project further demonstrates the translational potential of Bayesian modeling in clinical decision-making and public health strategy. By using patient-level predictors such as age, BMI, sex, and race to estimate the probability of diabetes, the model moves beyond descriptive statistics toward individualized risk prediction. The translational move lies in converting these probabilistic outputs into actionable thresholds—such as identifying the BMI or age at which the predicted risk of diabetes exceeds a clinically meaningful level (e.g., 30%). Such insights can guide early screening, personalized lifestyle interventions, and targeted prevention programs for populations at higher risk. This approach embodies precision public health—bridging data science and medical decision-making to deliver tailored, evidence-based strategies that can ultimately improve diabetes prevention and management outcomes.
What changes in modifiable predictors would lower diabetes risk?
# Translational Research Implications: - We now use the model to guide prevention or intervention. - Only BMI is a modifiable risk factor here - What must change in BMI, behavior, or lifestyle to achieve a lower risk threshold? In practice, we hold non modifiable predictors as constant (sex, race). Vary modifiable predictors (BMI) until the model predicts the desired probability.
# Below we used the data of first two particiants from the same dataset (Internal validation - as the data is from the same dataset) # Next we used fake data of a new participant (not in the dataset) - For model validation (External validation as the data is out-of-sample) - It means - applying the trained model to a specific individual’s data (new participanst) — using posterior draws to estimate predictive uncertainty (credible interval). - It is to test generalizability — how well the model performs on new, independent data.
this approach bridges individual-level predictions with population-level evidence.
predicted probabilities of diabetes for an individual participant obtained using the fitted Bayesian logistic regression model, accounted for model uncertainty and a 95% credible interval represents the range within which the true probability of diabetes for the participant is expected to lie.
using this approach researchers can better understand the uncertainty and variability in clinical risk predictions, allowing for more personalized and data-informed decision-making, helping translate statistical models into clinically actionable insights.
such probabilistic predictions can help identify individuals at higher risk and can guide targeted screening, preventive interventions, and policy decisions, ultimately supporting precision medicine and improving patient outcomes.
Code
# Use the first participant # using multiple covariates to select someoneparticipant1_data <- adult[1, ]# predicted probabilities for patient 1phat1 <-posterior_linpred(bayes_fit, newdata = participant1_data, transform =TRUE)# 'transform = TRUE' gives probabilities for logistic regression# Store in a data frame for plottingpost_pred_df <-data.frame(pred = phat1)# Compute 95% credible intervalci_95_participant1 <-quantile(phat1, c(0.025, 0.975))# Plotggplot(post_pred_df, aes(x = pred)) +geom_density(color='darkblue', fill='lightblue') +geom_vline(xintercept = ci_95_participant1[1], color='red', linetype='dashed') +geom_vline(xintercept = ci_95_participant1[2], color='red', linetype='dashed') +xlab('Probability of being diabetic (Outcome=1)') +ggtitle('Posterior Predictive Distribution 95% Credible Interval') +theme_bw()
Code
participant2_data <- adult[2, ]# predicted probabilities for patient 1phat2 <-posterior_linpred(bayes_fit, newdata = participant2_data, transform =TRUE)# 'transform = TRUE' gives probabilities for logistic regression# Store in a data frame for plottingpost_pred_df2 <-data.frame(pred = phat2)# Compute 95% credible intervalci_95_participant2 <-quantile(phat2, c(0.025, 0.975))# Plotggplot(post_pred_df2, aes(x = pred)) +geom_density(color='darkblue', fill='lightblue') +geom_vline(xintercept = ci_95_participant2[1], color='red', linetype='dashed') +geom_vline(xintercept = ci_95_participant2[2], color='red', linetype='dashed') +xlab('Probability of being diabetic (Outcome=1)') +ggtitle('Posterior Predictive Distribution 95% Credible Interval') +theme_bw()
Code
library(ggplot2)new_participant <-data.frame(age_c =40,bmi_c =25,sex ="Female",race ="Mexican American")# Posterior predicted probabilitiesphat_new <-posterior_linpred(bayes_fit, newdata = new_participant, transform =TRUE)# Convert to numeric vectorphat_vec <-as.numeric(phat_new)# Check the range to see if all values are similarrange(phat_vec)
[1] 1 1
Code
# Store in a data framepost_pred_df_new <-data.frame(pred = phat_vec)# Compute 95% credible interval from the vectorci_95_new_participant <-quantile(phat_vec, c(0.025, 0.975))# Plotggplot(post_pred_df_new, aes(x = pred)) +geom_density(color='darkblue', fill='lightblue', alpha =0.6) +geom_vline(xintercept = ci_95_new_participant[1], color='red', linetype='dashed') +geom_vline(xintercept = ci_95_new_participant[2], color='red', linetype='dashed') +xlim(0, 1) +# ensures you see the curve even if values are closexlab('Probability of being diabetic (Outcome=1)') +ggtitle('Posterior Predictive Distribution (95% Credible Interval)') +theme_bw()
# Below is the targeted bmi that we calculated In this analysis, a grid of body mass index (BMI) values is generated to examine the relationship between BMI and the predicted probability of diabetes while holding other covariates (age, sex, and race) constant. - Using the fitted Bayesian logistic regression model, posterior predictive probabilities computed for each BMI value and the mean of posterior draws estimated the average predicted probability of diabetes across the BMI range. - We then tried to identify the BMI value whose predicted probability was closest to a predefined target probability (here, 0.3). - This enabled interpretation of the model in a clinically meaningful way—specifically, by determining the BMI level associated with a 30% predicted probability of diabetes for a given demographic profile. - We were able to link statistical inference to practical health thresholds relevant for risk communication and prevention strategies in translational research.
Code
# Grid of possible BMI values (centered if model used bmi_c)bmi_seq <-seq(18, 40, by =0.5)newdata_grid <-data.frame(age_c =40,bmi_c = bmi_seq,sex ="Female",race ="Mexican American")# Posterior mean predicted probabilitiespred_probs <-posterior_linpred(bayes_fit, newdata = newdata_grid, transform =TRUE)# Average over posterior draws to get the mean predicted probability per BMIprob_mean <-colMeans(pred_probs)# Combine with BMI valuespred_df <-cbind(newdata_grid, prob_mean)target_prob <-0.3closest <- pred_df[which.min(abs(pred_df$prob_mean - target_prob)), ]closest
age_c bmi_c sex race prob_mean
1 40 18 Female Mexican American 1
Code
ggplot(pred_df, aes(x = bmi_c, y = prob_mean)) +geom_line(color ="darkblue", linewidth =1.2) +geom_hline(yintercept = target_prob, color ="red", linetype ="dashed") +geom_vline(xintercept = closest$bmi_c, color ="red", linetype ="dotted") +annotate("text", x = closest$bmi_c, y = target_prob +0.05,label =paste0("Target BMI ≈ ", round(closest$bmi_c, 1)),color ="red", hjust =-0.1) +labs(x ="BMI (centered or raw, depending on model)",y ="Predicted Probability of Diabetes",title ="Inverse Prediction: BMI Needed for Target Diabetes Risk" ) +theme_bw()
# Implications
age and BMI as robust and independent predictors of diabetes, underscore the importance of early targeted interventions in mitigating diabetes risk.
Longitudinal studies and combining other statistical analytical methods with Bayesian can further enhance and provide better informed precision prevention strategies.
# References
References
Austin, Peter C., Ian R. White, Douglas S. Lee, and Stef van Buuren. 2021. “Missing Data in Clinical Research: A Tutorial on Multiple Imputation.”Canadian Journal of Cardiology 37 (9): 1322–31. https://doi.org/10.1016/j.cjca.2020.11.010.
Buuren, Stef van, and Karin Groothuis-Oudshoorn. 2011. “mice: Multivariate Imputation by Chained Equations in R.”Journal of Statistical Software 45 (3): 1–67. https://doi.org/10.18637/jss.v045.i03.
Center for Health Statistics, National. 1999. “Vital and Health Statistics Reports Series 2, Number 161, September 2013.”National Health and Nutrition Examination Survey: Analytic Guidelines, 1999–2010. https://wwwn.cdc.gov/nchs/data/series/sr02_161.pdf.
Chatzimichail, Theodora, and Aristides T. Hatjimihail. 2023. “A Bayesian Inference Based Computational Tool for Parametric and Nonparametric Medical Diagnosis.”Diagnostics 13 (19). https://doi.org/10.3390/DIAGNOSTICS13193135,.
D’Angelo, Gina, and Di Ran. 2025. “Tutorial on Firth’s Logistic Regression Models for Biomarkers in Preclinical Space.”Pharmaceutical Statistics 24 (1): 1–8. https://doi.org/10.1002/pst.2422.
Klauenberg, Katy, Gerd Wübbeler, Bodo Mickan, Peter Harris, and Clemens Elster. 2015. “A tutorial on Bayesian Normal linear regression.”Metrologia 52 (6): 878–92. https://doi.org/10.1088/0026-1394/52/6/878.
Leeuw, Christiaan de, and Irene Klugkist. 2012. “Augmenting Data With Published Results in Bayesian Linear Regression.”Multivariate Behavioral Research 47 (3): 369–91. https://doi.org/10.1080/00273171.2012.673957.
Liu, Yi Ming, Sam Li Sheng Chen, Amy Ming Fang Yen, and Hsiu Hsi Chen. 2013. “Individual risk prediction model for incident cardiovascular disease: A Bayesian clinical reasoning approach.”International Journal of Cardiology 167 (5): 2008–12. https://doi.org/10.1016/J.IJCARD.2012.05.016.
Schoot, Rens van de, Sarah Depaoli, Ruth King, Bianca Kramer, Kaspar Märtens, Mahlet G. Tadesse, Marina Vannucci, et al. 2021. “Bayesian statistics and modelling.”Nature Reviews Methods Primers 1 (1): 1. https://doi.org/10.1038/s43586-020-00001-2.
Schoot, Rens Van De, David Kaplan, Jaap Denissen, Jens B Asendorpf, and Marcel A G Van Aken. 2013. “A Gentle Introduction to Bayesian Analysis: Applications to Developmental Research.”https://doi.org/10.1111/cdev.12169.
Van Buuren, Stef, and Stef Van Buuren. 2012. Flexible Imputation of Missing Data. Vol. 10. CRC press Boca Raton, FL.